Rationale - the Data lake

  • The GRINS foundation aims at implementating a Data platform for the transfer of knowledge and statistical analysis (AMELIA)

  • Prime matter of the platform: the data lake

    • Broad repository hosting several categories of administrative data from different sources
    • Available to either private, corporate or academiic users
    • Data organised at the territorial level of municipalities (LAU/NUTS-4)

If this document is visualised in PDF form, it is possible to open it in the HTML version

Scope

  • The present R package is intended as a one of the several contributions to the data lake

  • This package covers the dimension of public education, with special regards to the territorial structure of the education system.

  • Main utility: analysing territorial disparities in education quality and school infrastructure endowment

Principles followed

  • Accessibility: All data must be publicly accessible and easy to handle for the generic user
    • Input data are open and come from publicly accessible web pages
  • Updating: All information is retrieved in real time in order to be up-to date
    • Inputs are scraped from the web rather than stored in a built-in repository
  • Portability: All objects should be easy to export and process with different softwares:
    • We work in the framework, and all outputs are structured as tibbles

Main function modules

  • Get_: input data scraping. Information is not altered and the user receives a data set as close as possible as the provider releases it

  • Util_: utilities; mainly data modification and editing

  • Group_: data aggregation at the relevant territorial level

    • NUTS-3/Province
    • LAU/Municipality
  • Map_: displaying

    • Static maps (vector format): easy to export
    • Interactive maps: preserve information on different variables

Main datasets

  • Data from the Ministry of Education
    • Includes:
      • National Schools Registry
      • School Buildings database
      • Students and teachers counts
    • Mainly available at the school level (except for the count of teachers)
  • Ultra - Broadband implementation
    • Available at the school level
  • Invalsi census survey
    • Available at the NUTS-3 / LAU level

Schools Taxonomy

  • Schools ID - mechanographical codes
    • Most complete list: National Schools Registry
    • Identifies both school order and address (of high schools)
  • School buildings ID - typically numeric codes
    • Only included in the School buildings DB

School buildings database

  • Main source of information regarding the school infrastructure

  • Mostly includes categorical variables, regarding several aspects such as:

    • Environmental context
    • Reachability by public or private transport
    • Building period
    • Surfaces and volumes

School buildings database

  • Functions:
    • Get_DB_MIUR() Scrape the raw data
    • Util_DB_MIUR_num() Convert variables to numeric and edit raw data if required
    • Group_DB_MIUR() Harmonise at the territorial level
    • Map_SchoolBuildings() Render
Input_DB23_MIUR <- Get_DB_MIUR(Year = 2023, input_Registry = Registry23) 
DB23_MIUR_n <- Input_DB23_MIUR %>% Util_DB_MIUR_num()
DB23_MIUR <- Group_DB_MIUR(DB23_MIUR_n, InnerAreas = F)$
  Municipality_data %>%  dplyr::mutate(log_Surface = log(.data$School_area_surface))

School buildings database

Output of Util_DB_MIUR_num():

School buildings database

Output of Group_DB_MIUR()

School buildings database

DB23_MIUR %>% 
  Map_School_Buildings(input_shp = Mun22_shp, field = "log_Surface", 
                       level = "LAU", order = "Middle",
                       region_code = c(15:18), verbose = FALSE)

Students per class

  • Potential outliers: the user can set an acceptance boundary in terms of school-level average class size (arguments UB_nstud_byclass and LB_nstud_byclass)
nstud23 <- Get_nstud(2023, verbose = FALSE)
nstud23_byClass <- nstud23 %>% Util_nstud_wide(UB_nstud_byclass = 45)
## Filtered out 5 schools with less than 1  or more than 45 students per class
nstud23_mun <- Group_nstud(nstud23_byClass, 
  input_Registry = Registry23, input_School2mun = School2mun23, 
  verbose = FALSE)$Municipality_data

Students counts

Ultra-Broadband activation

  • National Ultra - Broadband plan (2020): regards about \(35.000\) schools over the national territory (the vast majority)
  • Plan expected to be fulfilled by 2023 EoY; however, works are still in progress for a number of schools
  • By definition, Ultra - BroadBand connection has a minimum guaranteed speed of 100 megabits/second until the peering, with a maximum of 1 Gigabit/second

Ultra - Broadband activation

BB24 <- Get_BroadBand(Date = as.Date("2024-01-01"), verbose = FALSE) %>% 
  dplyr::filter(.data$Order == "High") %>% 
  dplyr::group_by(.data$Region_description) %>% 
  dplyr::summarise(Status = mean(.data$BB_Activation_status)) 

Invalsi census survey

  • Aggregate measure of students skills, expressed either as the territorial average of:
    • Percentage of sufficient tests (primary schools only)
    • Ability of \(i\)-th student (\(A_i\)) to answer correctly the question \(Q_j\) of difficulty \(D_j\), based on the model \[Prob \lbrace Q_{ij} = 1 \rbrace = \frac{e^{A_i - D_j}}{1 + e^{A_i - D_j}}\]
  • Spatially homogeneous indicator
  • Three variables: M_: mean; S_: standard deviation; C_: coverage

Invalsi census survey

Map_Invalsi(input_shp = Prov22_shp, grade = 13, subj = "MAT", 
  Year = 2023, level = "NUTS-3",  main ="Maths score",
  pal = "viridis", verbose =FALSE)